Rethinking The Corpus: Moving towards Dynamic Linguistic Resources

نویسنده

  • Andrew Rosenberg
چکیده

The corpus is an invaluable resource in Spoken and Natural Language Processing. Consistent data sets have allowed for empirical evaluation of competing algorithms. The sharing of high-quality annotated linguistic data has enabled participation and experimentation by a wide range of researchers. However, despite dubbing these annotations as " gold-standard " , many corpora contain labeling errors and idiosyncrasies. The current view of the corpus as a static resource makes correction of errors and other modifications prohibitively difficult. In this paper, a perspective of the corpus as dynamically changing is advanced. We highlight the problems of the static view of the corpus through case studies of the Penn Treebank, Switchboard, Hub-4 and Boston University Radio News Corpus. We propose the use of version control software as a mechanism to facilitate this dynamic view. Abstract Index Terms: Linguistic Resources, Opinion paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards International Standards for Language Resources

This paper describes the Linguistic Annotation Framework (LAF) developed by the International Standards Organization TC32 SC4, which is to serve as a basis for harmonizing existing language resources as well as developing new ones. We then describe the use of the LAF to represent the American National Corpus and its linguistic annotations.

متن کامل

Towards Open Data for Linguistics: Linguistic Linked Data

Open Data’ has become very important in a wide number of fields. However for Linguistics, much data is still published in closed formats and is not made available on the web. We propose the use of linked data principles to enable language resources to be published and interlinked openly on the web and describe the application of this paradigm to the modeling of two language resources, WordNet a...

متن کامل

Agile Corpus Annotation in Practice: An Overview of Manual and Automatic Annotation of CVs

This paper describes work testing agile data annotation by moving away from the traditional, linear phases of corpus creation towards iterative ones and by recognizing the potential for sources of error occurring throughout the annotation process.

متن کامل

A Contrastive Study of Metadiscourse in English and Persian Editorials

The original impetus for this cross-linguistic study came from a need to explore the effect of cultural factors and generic conventions on the use and distribution of metadiscourse within a single genre. To this end, the study as a contrastive rhetoric research, examined a corpus of 60 newspaper editorials (written in English and Persian) culled from 10 elite newspapers in America and Iran. Bas...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012